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Identifying Moving Objects in a Video Using Volume Growing and Change 

Detection Masks 

Field of the Invention 

The invention relates generally to video processing, and in particular, to identifying 
moving objects in a video. 

Background of the Invention 

Many videos require processing to find objects, determine events, quantify 
application-dependant visual assessments, as shown by the recent MPEG-4 and 
MPEG-7 standardization efforts, and to analyze characteristics of video sequence, 
see, e.g., R. Castagno, T. Ebrahimi, and M. Kunt, "Video segmentation based on 
multiple features for interactive multimedia applications," IEEE Trans, on Circuits 
and Systems for Video Technology, Vol.8, No.5, pp. 562-571, September 1998. 
Content-based video representation requires the decomposition of an image or video 
sequence into specific objects, i.e., separating moving persons from static 
backgrounds. 

Many television broadcasts contain scenes where a person is speaking in front of a 
relatively static background, i.e., news programs, panel shows, biographies, soap 
operas, etc. Also, video-conference applications extensively use head-and-shoulder 
scenes to achieve visual communication. Increasing availability of mobile video 
cameras will prevail peer-to-peer, bandwidth constrained facial communication in 
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the future. Thus, accurate object segmentation of head-and-shoulder type video 
sequences, also known as "talking head," is an important aspect of video processing. 

However, automatic segmentation of head-and-shoulder type sequences is difficult. 
Parameter based methods cannot accurately estimate the motion of an object in that 
type of sequence, because usually, a talking head sitting at a disk exhibits minimal 
motion. Moreover, motion-based segmentation methods are computationally 
expensive and unreliable. Region-based methods have disadvantages such as over- 
segmentation and can fail to determine a region-of-interest. Frame difference based 
methods suffer from inaccurate object shape determinations. 

Another method for object segmentation utilizes volume growing to obtain the 
smallest color consistent components of a video, se, e.g., F. Porikli, and Y. Wang, 
"An unsupervised multi-resolution object extraction algorithm using video-cube,'" 
Proceedings of Int. Conf. Image Process, Thesselaniki, 2001, see also, U.S. Patent 
Application 09/826,333 "Method for Segmenting Multi-Resolution Video Objects" 
filed by Porikli et al. on April 4, 2001. First, a fast median filter is applied to video to 
remove local color irregularities, see, e.g., M. Kopp and W. Purgathofer, "Efficient 
3x3 median filter computations," Technical University, Vienna, 1994. Then, a spatio- 
temporal data structure is formed from the input video sequence by indexing the 
image frames and their features. The object information can be propagated forward 
and as well as backward in time by treating consecutive video frames as the planes of 
a 3D data structure. After a video sequence is filtered, marker points are selected by 
color gradient. A volume around each marker is grown using color distance. The 
problem with video volumes is that moving objects are indistinguishable from static 
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objects. For example, with volume growing a blank wall of a distinct color will form 
a volume. 

A change detection mask (CDM) is a map of pixels that change between previous 
and the current frames of a pair of frames in a sequence of video. A CDM is defined 
as the color dissimilarity of two frames with respect to a given set of rules. 
Considering a stationary camera, consistent objects, and constant lighting conditions, 
the pixel-wise color difference of a pair of adjacent frames is an indication of moving 
objects in the scene. However, not all the color change happens because of moving 
objects. Camera motion, intensity changes and shadows due to the non-uniform 
lighting across video frames, and image noise also contribute to frame difference. 
The computational simplicity makes the CDM practical for real-time applications, 
see, e.g., C.S.Regazzoni, G.Fabri, and G.Vernazza, "Advanced video-based 
surveillance system", Kluwer Academic Pub., 1999. However, using the CDM alone 
to determine moving objects renders poor segmentation performance. 

Therefore, there is a need for an improved, fully automatic method for precisely 
identifying any number of moving objects in a video, particularly where the object 
has very little motion relative to the background, e.g., a talking head. The method 
should integrate both motion and color features in the video over time. The 
segmentation should happen in a reasonable amount of time, and not be dependent 
on an initial user segmentation, nor homogeneous motion constraints. 
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Summary of the Invention 

The present invention provides an automatic method for identifying moving objects 
in a video. The method combines volume growing with change detection. After an 

5 input video is filtered to remove noise, a spatio-temporal data structure is formed 
from the video frames, and markers are selected. From the markers, volumes are 
grown using a color similarity based centroid linkage method. Change detection 
masks are then extracted from adjacent frames in the video using local color features. 
The change detection masks are intersected with each volume, to determine the 
If) number of changed pixels only in portions of the masks that lie within that volume. If 

p{ the number of changed pixels in the intersection exceeds a threshold, then the 

HI volume is identified as a moving object. 

3;'.'. 

iffli 

1 £u Brief Description of the Drawings 

I Figure 1 is a block diagram of a method for identifying moving objects in a video 
according to the invention; 

20 Figure 2 is a block diagram of a segmenting volumes step of the method of Figure 1 ; 

Figure 3 is a block diagram of an extracting change detection masks step of the 
method of Figure 1 ; and 

25 Figure 4 is a block diagram of an identifying moving objects step of the method of 
Figure 1. 
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Detailed Description of the Preferred Embodiment 

The invention identifies moving objects in a video 101 using spatiotemporal volume 
5 growing and change detection masks. The invention is particularly useful to identify 
objects in a video that have little motion, such as "talking heads." As shown in 
Figure 1, a first step segments 200 volumes 241 from the video 101 by constructing a 
spatiotemporal data structure from frames of the video 101. Markers m i are selected 
from the data structure. The markers are starting points for growing the volumes v; 
ljgj 241 . A second step extracts 300 change detection masks 341 from the input video 
P 101 . The masks are extracted by determining a change in color features of 
Cf J corresponding pixels in an adjacent pair of frames. In a third step, the extracted 
m masks 341 are applied to the volumes 241 to identify 400 moving objects 421 in the 
= video 101. 

il 

;: 3.7 

g Segmenting Volumes 

Constructing Spatiotemporal Data Structure S 

Figure 2 shows the details of the segmenting step 200 of Figure 1. First, in an 
20 optional preprocessing step, fast median filtering 210 is applied to the video 101 to 
remove local irregularities. The next step constructs 220 the spatiotemporal data 
structure S 221 from the pixels of the frames of the input video 101. Each element in 
the data structure S(x, y,t) is a vector w(x,y,t) that includes color values and change 
detection scores of a pixel at a location (x,y,t), where (x,y) are coordinates of the 
25 pixel in a particular frame t of the input video 101. 
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Marker Selecting 

A vector with a minimum color gradient magnitude is selected 230 as a marker 231. 
The marker 231 is the starting point for growing 240 an unrefined volume 241. In 
one preferred embodiment, a YUV color space is used because this color space 
performs in accordance with human visual perception, and inter-color distances can 
be computed by the magnitude or Euclidian distance norms. Any color space can be 
used as far as. inter-color distance formula is adapted accordingly. 

The marker is selected 230 by determining which of the vectors 221 has a minimum 
color gradient magnitude because vectors with minimum gradient magnitudes best 
characterize a uniformly textured local neighborhood of pixels. The color gradient 
magnitude |VS| is determined by: 

|V5 (x, y, t)\ = \w y (x~ ,y,t)- w y (x + , y, t)\ + 1 w u (x, y~ , t) - w u ( y + , t)\ 

+ \w v (x,y,r)-w v (x,y,t + )\, (1) 

where ( )" and ( ) + represent equal distances from a central pixel in the local 
neighborhood. For computational simplicity, only the luminance component w y is 
used. Then, the vector with the minimum gradient magnitude is selected 230 as a 
marker m 231. 

Growing Volumes 

An unrefined volume 241 is grown 240 around the marker 231. A centroid-linkage 
method is used for growing volumes 240. The centroid a is the vector w(mi) of the 
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marker. An active shell includes all outer boundary vectors p + of the current volume 
241. Adjacent vectors p- are selected in a 6-neighborhood that includes the vectors 
(x+l,y,t), (x-l,y,t), (x,y+l,t), (x,y-l,t), (x,y,t+l), (x,y,t-l), to an active shell vector 
(x,y,t). Vectors p- adjacent to the active shell are compared to the centroid, and a 
color distance d(a,p-) between the centroid and each adjacent vectors p- is 
determined. If the color distance between the centroid and the adjacent vector is less 
than a threshold s, then the adjacent vector is included in the unrefined volume, and 
the centroid a is updated. To determine the color distance threshold e, the pixels of 
the input video 101 are quantized, using dominant colors, by vector clustering in 
color space. The quantization improves the robustness of centroid-linkage method by 
simplifying the color spectrum. 

When the volume 241 is grown, its vectors are removed from a set Q according to 

m, = arg min|VS(x, y,t)\ ; Q = S-vV j , (2) 
where Q, initially, is the set of all vectors 221. 

Then, the next vector having a minimum gradient magnitude in the remaining set is 
selected as a next marker, and the volume growing process is repeated 235 until no 
more vectors 221 remain. 

Volume Merging 

Merging 250 reduces irregularities in the unrefined volumes 241. Volumes that are 
less than a minimum size are merged 250 with adjacent volumes. For example, 
volumes less than 0.001 of the volume V, i.e., the entire video, are merged. To 
accelerate this process, the merging 250 is performed in a hierarchical manner by 
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starting with the smallest volume, and ending with the largest volume that does not 
satisfy the minimum size requirement. The smallest volume that does not satisfy the 
minimum size requirement is merged with a closest volume. This is repeated for all 
small volumes in an order of increasing size. 

Extracting Change Detection Masks 

Determining Distances 

Figure 3 shows the details of the extracting step 300 of Figure 1. This step extracts 
change detection masks from adjacent pairs of frames. First, distances 311 are 
determined 310 for a pixel p(x,y,t) in a local window of a current frame t and a pixel 
t-1) in an adjacent frame t-1 

= XX| W * (*. y.O - w t (x ni ,y nj ,t- 1)1 , (3) 

ij k 

where x n „y nj are coordinates of a'pixel around the center pixel q n (x m y n ,t) in the 
window Nj and k is the color components y,u,v, produces distances 6{p,q) 311. The 
points q n {x n ,y n ,t-l) are chosen in another window N 2 . The color components can be 
chosen from any color space i.e., RGB, HIS, etc. In case a single channel input is 
used, k represents that single channel, i.e. gray level. 
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Selecting Minimum Scores 

Selecting minimum scores 320 prevents minor errors in motion estimation. The 
minimum of distances S(p,q n ) 3 1 1 in another window N 2 is assigned as the score 
5 A(p) 321 of each pixel p according to 

Ap = minS(p,q), q n eN 2 . (4) 

n n 

Averaging Scores 

if ! Averaging scores 330 in window N 3 produces averaged scores 33 1 for thresholding 
f jj 340 to produce change detection masks 341 . 

ii.5 ii 

\% 

J| Thresholding Scores 

1 §{ Thresholding scores 340 produces binary change detection masks cdm(p) 341 , 

R cdm^J 1 (5) 
[0 else 

where ^ is a threshold. It can be assigned as the weighted average of the dynamic 
ranges of the color components. The score threshold is chosen such that the average 
20 scores 33 1 correspond to a cluster of changed points instead of single points. Small 
regions are filtered in this way. 
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Alternative Change Detection Masks 

Other change detection masks can be used instead of the above method. These masks 
includes but not limited to the frame difference operators, global motion 
compensated masks, non-binary change detection masks. Therefore the method 
explained in the disclosure covers all change detection mask extraction methods. A 
simple change detection mask may be 

cdm ( p) = £ \w k (x, y , t) - w k (x, y, t - 1)| (6) 

where pixel/? is the pixel (x,y) in the frame t, and k represents the color components. 

Applying Change Detection Masks to the Segmented Volumes 

Figure 4 shows the details of the identifying step 400 of Figure 1. After segmenting 
200 volumes and extracting 300 masks, moving objects are identified 400. For each 
volume, count the number of changed pixels only in portions of the masks that 
intersects the volume. The total counts can be normalized, and volumes having 
counts exceeding a predetermined threshold are identified as moving objects 421. 

The present invention can accurately identify a moving object in a video, particularly 
where the object has very little motion associated with it, e.g., a head-and-shoulders 
type video sequence. The method uses both motion and color features over time. The 
identification occurs in a reasonable amount of time, and is not be dependent on an 
initial user segmentation, nor homogeneous motion constraints. Identified motion 
objects can now readily be segmented. 
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A further advantage of the invention is that it does not require background 
registration. In addition, the invention can extract object boundaries accurately 
without using snake-based models or boundary correction methods. The presented 
method can also segment moving smoothly textured objects. 

Although the invention has been described by way of examples of preferred 
embodiments, it is to be understood that various other adaptations and modifications 
may be made within the spirit and scope of the invention. Therefore, it is the object 
of the appended claims to cover all such variations and modifications as come within 
the true spirit and scope of the invention. 
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